11 research outputs found
Heavy-tailed Independent Component Analysis
Independent component analysis (ICA) is the problem of efficiently recovering
a matrix from i.i.d. observations of
where is a random vector with mutually independent
coordinates. This problem has been intensively studied, but all existing
efficient algorithms with provable guarantees require that the coordinates
have finite fourth moments. We consider the heavy-tailed ICA problem
where we do not make this assumption, about the second moment. This problem
also has received considerable attention in the applied literature. In the
present work, we first give a provably efficient algorithm that works under the
assumption that for constant , each has finite
-moment, thus substantially weakening the moment requirement
condition for the ICA problem to be solvable. We then give an algorithm that
works under the assumption that matrix has orthogonal columns but requires
no moment assumptions. Our techniques draw ideas from convex geometry and
exploit standard properties of the multivariate spherical Gaussian distribution
in a novel way.Comment: 30 page
Non-Euclidean Differentially Private Stochastic Convex Optimization
Differentially private (DP) stochastic convex optimization (SCO) is a
fundamental problem, where the goal is to approximately minimize the population
risk with respect to a convex loss function, given a dataset of i.i.d. samples
from a distribution, while satisfying differential privacy with respect to the
dataset. Most of the existing works in the literature of private convex
optimization focus on the Euclidean (i.e., ) setting, where the loss is
assumed to be Lipschitz (and possibly smooth) w.r.t. the norm over a
constraint set with bounded diameter. Algorithms based on noisy
stochastic gradient descent (SGD) are known to attain the optimal excess risk
in this setting.
In this work, we conduct a systematic study of DP-SCO for -setups.
For , under a standard smoothness assumption, we give a new algorithm with
nearly optimal excess risk. This result also extends to general polyhedral
norms and feasible sets. For , we give two new algorithms, whose
central building block is a novel privacy mechanism, which generalizes the
Gaussian mechanism. Moreover, we establish a lower bound on the excess risk for
this range of , showing a necessary dependence on , where is
the dimension of the space. Our lower bound implies a sudden transition of the
excess risk at , where the dependence on changes from logarithmic to
polynomial, resolving an open question in prior work [TTZ15] . For , noisy SGD attains optimal excess risk in the low-dimensional regime;
in particular, this proves the optimality of noisy SGD for . Our work
draws upon concepts from the geometry of normed spaces, such as the notions of
regularity, uniform convexity, and uniform smoothness
A new over-dispersed count model
A new two-parameter discrete distribution, namely the PoiG distribution is
derived by the convolution of a Poisson variate and an independently
distributed geometric random variable. This distribution generalizes both the
Poisson and geometric distributions and can be used for modelling
over-dispersed as well as equi-dispersed count data. A number of important
statistical properties of the proposed count model, such as the probability
generating function, the moment generating function, the moments, the survival
function and the hazard rate function. Monotonic properties are studied, such
as the log concavity and the stochastic ordering are also investigated in
detail. Method of moment and the maximum likelihood estimators of the
parameters of the proposed model are presented. It is envisaged that the
proposed distribution may prove to be useful for the practitioners for
modelling over-dispersed count data compared to its closest competitors
A genome-wide association study identifies risk alleles in plasminogen and P4HA2 associated with giant cell arteritis
Giant cell arteritis (GCA) is the most common form of vasculitis in individuals older than 50 years in Western countries. To shed light onto the genetic background influencing susceptibility for GCA, we performed a genome-wide association screening in a well-powered study cohort. After imputation, 1,844,133 genetic variants were analysed in 2,134 cases and 9,125 unaffected controls from ten independent populations of European ancestry. Our data confirmed HLA class II as the strongest associated region (independent signals: rs9268905, P = 1.94E-54, per-allele OR = 1.79; and rs9275592, P = 1.14E-40, OR = 2.08). Additionally, PLG and P4HA2 were identified as GCA risk genes at the genome-wide level of significance (rs4252134, P = 1.23E-10, OR = 1.28; and rs128738, P = 4.60E-09, OR = 1.32, respectively). Interestingly, we observed that the association peaks overlapped with different regulatory elements related to cell types and tissues involved in the pathophysiology of GCA. PLG and P4HA2 are involved in vascular remodelling and angiogenesis, suggesting a high relevance of these processes for the pathogenic mechanisms underlying this type of vasculitis
Heavy-Tailed Analogues of the Covariance Matrix for ICA
Independent Component Analysis (ICA) is the problem of learning a square matrix A, given samples of X = AS, where S is a random vector with independent coordinates. Most existing algorithms are provably efficient only when each Si has finite and moderately valued fourth moment. However, there are practical applications where this assumption need not be true, such as speech and finance. Algorithms have been proposed for heavy-tailed ICA, but they are not practical, using random walks and the full power of the ellipsoid algorithm multiple times. The main contributions of this paper are (1) A practical algorithm for heavy-tailed ICA that we call HTICA. We provide theoretical guarantees and show that it outperforms other algorithms in some heavy-tailed regimes, both on real and synthetic data. Like the current state-of-the-art, the new algorithm is based on the centroid body (a first moment analogue of the covariance matrix). Unlike the state-of-the-art, our algorithm is practically efficient. To achieve this, we use explicit analytic representations of the centroid body, which bypasses the use of the ellipsoid method and random walks. (2) We study how heavy tails affect different ICA algorithms, including HTICA. Somewhat surprisingly, we show that some algorithms that use the covariance matrix or higher moments can successfully solve a range of ICA instances with infinite second moment. We study this theoretically and experimentally, with both synthetic and real-world heavy-tailed data